Journals
  Publication Years
  Keywords
Search within results Open Search
Please wait a minute...
For Selected: Toggle Thumbnails
Single precision floating general matrix multiply optimization for machine translation based on ARMv8 architecture
GONG Mingqing, YE Huang, ZHANG Jian, LU Xingjing, CHEN Wei
Journal of Computer Applications    2019, 39 (6): 1557-1562.   DOI: 10.11772/j.issn.1001-9081.2018122608
Abstract701)      PDF (1002KB)(559)       Save
Aiming at the inefficiency of neural network inferential calculation executed by mobile intelligent devices using ARM processor, a set of Single precision floating GEneral Matrix Multiply (SGEMM) algorithm optimization scheme based on ARMv8 architecture was proposed. Firstly, it was determined that the computational efficiency of the processor based on ARMv8 architecture executing SGEMM algorithm was limited by the vectorized computation unit usage scheme, the instruction pipeline, and the probability of occurrence of cache miss. Secondly, three optimization techniques:vector instruction inline assembly, data rearrangement and data prefetching were implemented for the three reasons that the computational efficiency was limited. Finally, the test experiments were designed based on three matrix patterns commonly used in the neural network of speech direction and the programs were run on the RK3399 hardware platform. The experimental results show that, the single-core computing speed is 10.23 GFLOPS in square matrix mode, reaching 78.2% of the measured floating-point peak value; the single-core computing speed is 6.35 GFLOPS in slender matrix mode, reaching 48.1% of the measured floating-point peak value; and the single-core computing speed is 2.53 GFLOPS in continuous small matrix mode, reaching 19.2% of the measured floating-point peak value. With the optimized SGEMM algorithm deployed into the speech recognition neural network program, the actual speech recognition speed of program is significantly improved.
Reference | Related Articles | Metrics